Mobile video search

ABSTRACT

A facility for using a mobile device to search video content takes advantage of computing capacity on the mobile device to capture input through a camera and/or a microphone, extract an audio-video signature of the input in real time, and to perform progressive search. By extracting a joint audio-video signature from the input in real time as the input is received and sending the signature to the cloud to search similar video content through the layered audio-video indexing, the facility can provide progressive results of candidate videos for progressive signature captures.

COPYRIGHT NOTICE AND PERMISSION

A portion of the disclosure of this patent document may contain materialthat is subject to copyright protection. The copyright owner has noobjection to the facsimile reproduction by anyone of the patent documentor the patent disclosure, as it appears in the Patent and TrademarkOffice patent files or records, but otherwise reserves all copyrightrights whatsoever. The following notice shall apply to this document:Copyright © 2013, Microsoft Corp.

BACKGROUND

Mobile devices with access to the Internet and the World Wide Web havebecome increasingly common, serving as personal Internet-surfingconcierges that provide users with access to ever increasing amounts ofdata while on the go.

Mobile devices do not currently provide a platform that is conducive forsome types of searching, in particular searching video content withoutexpending the resources to record and send the recording of the searchsubject as a query.

Some search applications for mobile devices support photographs takenwith a camera built into the mobile device as a visual query, which iscalled capture-to-search. In capture-to-search, typically a picture isfirst snapped. Then that snapshot is submitted as the query to searchfor a match in various vertical domains. Other search applicationssupport audio recorded from a microphone built into the mobile device asan audio query. For example, INTONOW allows users to record audio foruse as a query. However, that sound is recorded for a period of up to,about 12 seconds. Then that sound recording is submitted as a query tosearch for a match in various vertical domains. This process does notwork well if the recording conditions are noisy or in the case of avideo without sound such that the recording is silent.

Some search engines for audio files use an even longer recording time.However, typical audio search engines do not search for audio incombination with video, and they still require that the actual recordingbe submitted as the query.

Yet other search applications support video images taken with a camerabuilt into the mobile device as a visual query, which can be calledvideo capture-to-search. VIDEOSURF is an example of videocapture-to-search. In VIDEOSURF, a video image is captured for a periodof at least 10 seconds and stored. A user then chooses thediscriminative visual content for search, and then that video image clipis submitted as a query to search for a matching video.

Existing mobile video search applications expend significant resourcesto store a relatively long audio and/or video clip and to send therecorded clip to the search engine. Once the search engine receives therecorded video clip query, the search engine can perform matching basedon the clip. The existing methods require a clip of fixed duration e.g.,10 or 12 seconds.

Most research related to video search on mobile devices has focused oncompact descriptor design on mobile devices. The most popular way tosolve this problem is compressing descriptors through the technology ofimage coding for near-duplicate video search, which can be classifiedinto three categories according to the type of data modality they relyon: audio-based, video-based, and fusion-based methods. However, mostexisting approaches to near-duplicate video search predominantly focuson desktop scenarios where the query video is usually a subset of theoriginal video without significant distortion rather than video capturedby the mobile device. Moreover, the computational costs and compactnessof descriptors are often neglected in the existing approaches becauseconventional approaches to duplicate video search do not take theaforementioned mobile challenges into account. Conventional approachesto duplicate video search are not suitable for mobile video search.

SUMMARY

This document describes a facility for video search on a mobile devicethat takes advantage of computing resources available on the mobiledevice to extract audio and video characteristics of video content beingpresented by a device other than the mobile device and to send thecharacteristics as a query rather than sending a recording of the videocontent as the query. By extracting audio and video characteristics foruse as a search query, and by matching the audio and videocharacteristics to audio and video characteristics stored in an indexeddataset of video content, the facility provides candidate videos foreach audio and video characteristic submitted including when thecharacteristics are extracted in noisy, poorly lit, or inconsistentconditions. The facility provides for presentation of an indication ofcandidate videos while additional portions of video input are beingobtained and for progressive refinement of the candidate videos to beindicated. The facility provides a listing of the candidate videos,including revising the listing of candidate videos being provided whilstadditional portions of video input are being obtained until a selectionis made from the candidate videos being provided or until the resultslist of candidate videos stabilizes, e.g., the results list of candidatevideos ceases to change for a period of time and the search stops. Thefacility provides for a different presentation of an indication ofcandidate videos in response to the results list of candidate videosstabilizing, e.g., ceasing to change for a period of time. The facilityalso provides for presentation of an additional interface in response toselection being made from the candidate videos being provided; forexample, the facility provides for a browser opening to allow a user tobuy or rent the selected video, to allow the user to see additional orauxiliary information about the selected video, or to allow the user tosave an indication of the video for later viewing.

This Summary is provided to introduce a selection of concepts in asimplified form that are further described below in the DetailedDescription. This Summary is not intended to identify key features oressential features of the claimed subject matter, nor is it intended tobe used as an aid in determining the scope of the claimed subjectmatter. The terms “techniques” for instance, may refer to method(s)and/or computer-executable instructions, module(s), algorithms, hardwarelogic (e.g., Field-programmable Gate Arrays (FPGAs),Application-Specific Integrated Circuits (ASICs), Application-SpecificStandard Products (ASSPs), System-on-a-chip systems (SOCs), ComplexProgrammable Logic Devices (CPLDs)), and/or “facility,” for instance,may refer to hardware logic (e.g., Field-programmable Gate Arrays(FPGAs), Application-Specific Integrated Circuits (ASICs),Application-Specific Standard Products (ASSPs), System-on-a-chip systems(SOCs), Complex Programmable Logic Devices (CPLDs)), other device(s),and/or other system(s) as permitted by the context above and throughoutthe document.

BRIEF DESCRIPTION OF THE DRAWINGS

The detailed description is described with reference to the accompanyingfigures. In the figures, the left-most digit(s) of a reference numberidentifies the figure in which the reference number first appears. Thesame numbers are used throughout the drawings to reference like featuresand components.

FIG. 1 is a pictorial diagram of an example implementation of mobilevideo search using a mobile device capturing audio-video from a videopresentation on a television.

FIG. 2 is a pictorial diagram of part of an example user interface ofmobile video search from the embodiment of FIG. 1.

FIG. 3 is a pictorial diagram of an example architecture forimplementing mobile video search.

FIG. 4 is a block diagram that illustrates select components of examplemobile devices configured for mobile video search.

FIG. 5 is a block diagram that illustrates select components of exampleserver devices configured for layered audio-video indexing, which can beemployed for mobile video search.

FIG. 6 is a pictorial diagram of an example framework of a systemincluding a mobile device implementing mobile video search and a serverimplementing layered audio-video indexing.

FIG. 7 is a pictorial diagram of an example of extraction of an audiofingerprint.

FIG. 8 is a pictorial diagram of an example of a layered audio-videoindex.

FIG. 9 is a flow diagram of an example process for implementing mobilevideo search on a client device.

FIG. 10 is a flow diagram of an example process for implementing videosearch on a server using a layered audio-video index.

FIG. 11 is a flow diagram that illustrates an example implementation ofprogressive processing during video search on a server using a layeredaudio-video index.

DETAILED DESCRIPTION

A mobile video search tool provides a rich set of functionalities toobtain relevant results for video search. Compared to a desktopcomputer, which predominantly supports search employing snippets ofactual video files, a mobile device provides a rich set of interfacesfor user interaction that can be employed to ease use and obtain resultsin a variety of environments. For example, beyond file upload anddownload and traditional keyboard and mouse inputs received in thedesktop environment, mobile devices are enabled to receive additionalmulti-modal inputs. Mobile device interfaces can combine visual modalityvia a built-in camera and audio modality via a built in microphone.

While mobile devices can combine such input modalities, video searchfrom mobile devices faces a variety of challenges. For example, one ofthe challenges faced by mobile video search is that a search may bedesired although the user is less than ideal conditions. The environmentmay be noisy, inconsistently lit or with fluctuating lighting, and/or inan environment with fluctuating speeds of internet connection. Incontrast, video search from desktop computers typically includessubmitting a snippet of the video file rather than a recording of apresentation of the video on another device as captured with a cameraand/or microphone.

Other challenges faced by mobile video search include hardwarelimitations of mobile devices. The processor, e.g., Central ProcessingUnit (CPU) and Graphics Processing Unit (GPU) and the memory of mobiledevices are still not comparable with desktop computers. Stringentmemory and computation constraints make signatures with large memorycosts or heavy computation unsuitable for mobile clients. Additionally,the negative effects of network and bandwidth limitations. With mobiledevices network connection is often unreliable and bandwidth isrelatively low. In mobile video search as described herein, the effectsof bottlenecks and dropped connections can be limited by using compactsignatures to reduce the volume of data communicated over the networkand ultimately to reduce network latency. In addition, users of mobiledevices for search are sensitive to search latency. Presentation ofpreliminary results, including results from progressive search, while ashorter than conventional query clip is being captured reduces apparentlatency for the user.

For example, a user may be walking to a meeting and notice a videopresentation in a storefront window. Even though the user does not havetime to stop and watch the video, the user may capture a few seconds ofthe video using the tool until the tool returns a matching video. Theuser may save the name of the video for later viewing. In this example,a client-side tool that can capture audio from the presentation andvideo images from the presentation and perform lightweighttransformation on the captured content. The transformation includesextracting an audio fingerprint and extracting visual hash bits even innoisy street conditions. In view of the relatively limited memory andcomputing resources of a mobile device compared to a desktop computer,for example, can make it infeasible to extract computationally expensivesignatures to present the video clip. Moreover, the bandwidth availableto send the video clip to a server for processing may not be availableor duration of transmission may be unacceptably long. By employing thecomputing capacity on the mobile device, the tool can perform thetransformation and transmit a much lower amount of data over thenetwork. For example, the extraction of an audio fingerprint may resultin approximately 0.5 KB of data for a second of video. Similarly, theextraction of visual hash bits from the video may result inapproximately 1.0 KB of data for a second of video. Thus, an audio-videosignature of these combined characteristics can be sent for less than 2KB of data compared to the amount of data to send the entire second ofvideo clip. Moreover, because of the decreased latency of the retrievalsystem, possible matches can be returned while the video input is stillbeing obtained, such as for progressive presentation of candidateresults. When no additional candidate video matches are being obtainedor the results list does not change for a period of time, e.g., 3seconds, then the search can cease as a video matching the query hasbeen identified, the search can automatically stop, and the userinterface can be changed to reflect the stabilized list of candidateresults.

Aspects of a mobile video search tool as described herein can beimplemented as a search application running on the mobile device and/orvia an application programming interface (API). The mobile video searchtool can capture the video input for query and perform extraction of theaudio fingerprint and visual hash bits to form the audio-videosignature. In the case of an application running on the mobile device,the application can send the audio-video signature as the video searchquery. In the case of an API, the application can expose the audiofingerprint and visual hash bits making up the audio-video signature viaan API for another application to use for video search.

In the cloud, the system is able to index large-scale video data using anovel Layered Audio-VidEo (LAVE) indexing scheme; while on the client,the system extracts light-weight joint audio-video signatures in realtime and searches in a progressive way. The LAVE scheme combinesaudio-video signatures through joint multi-layered audio-video indexing,which preserves each signature's individual structure in the similaritycomputation and considers their correlation in the combination stage.The joint audio-video signature is computationally cheap for mobiledevices and reinforces the discriminative power from the individualaudio and visual modalities. Thus the audio-video signature is robust tolarge variances, e.g., noise and distortion in the query video. Invarious embodiments, a learned hash function significantly reduces thenumber of bits to transfer from the mobile device over a network such asto a server or the cloud. A two-part graph transformation and matchingalgorithm makes the video search progressive, which means the search canstop when a stable result is achieved. As described herein a result isstable when the results do not change for a period of time, e.g., forthree seconds. In at least one implementation, the system describedherein achieved more than 90%, e.g., 90.77%, precision when the queryvideo was less than 10 seconds and about 70%, e.g., 70.07%, precisionwhen the query video was less than 5 seconds.

As described herein a server or cloud computing environment, which mayalso be referred to as a network-distributed environment, can host alayered audio-video index of video content upon which the search is run.Similar to the description of acquisition of audio fingerprint andvisual hash bits to obtain an audio-video signature, the server or cloudcomputer can perform extraction of audio-video signatures on video filesfrom a library of video files. The extracted audio-video signature canbe stored as a layered audio-video index, which can reduce searchlatency compared to other search structures.

In various embodiments, searching the LAVE Index includes a multi-stepprocess. In at least one embodiment first, the video search engine usesthe audio fingerprint from the query as a filter. Second, the videosearch engine compares key frames from the filtered set for similarity.Third, the video search engine performs geometric verification to obtainthe closest results. The video search engine may rank the closestresults, and the video search engine may update the closest resultsand/or the ranking as additional audio-video signatures are run from thequery. The video search engine can send representations of the candidateresult videos toward the mobile device from which the query originated.In some embodiments, the candidate results may be presented in a userinterface shared with the audio-video capture presentation while it isongoing. In at least one embodiment, the candidate results can bepresented progressively in the user interface shared with theaudio-video capture presentation while capture of the video input forquery and extraction of the audio fingerprint and visual hash bits toform the audio-video signature are occurring. In the event the resultslist stabilizes, the capture aspect can end and the user interface cantransition to a presentation of a search result list of the stablelisting of candidate results with or without additional information.

In at least one embodiment, the mobile video search techniques describedherein are implemented in a network-distributed environment. Thenetwork-distributed environment may include one or more types ofcomputing resources, which types of computing resources may includecomputing, networking and/or storage devices. A network-distributedenvironment may also be referred to as a cloud-computing environment.

Aspects of various embodiments are described further with reference toFIGS. 1-11.

Example Implementation

FIG. 1 shows an implementation of an example embodiment of mobile videosearch using a mobile device as described herein. In the illustratedexample, a user 102 is using a mobile computing device 104 such as atablet or smartphone. In FIG. 1, the mobile computing device 104 isshown with a user interface representative of capturing audio and visualinput from a video presentation 106 on a television 108 via a mobilevideo search tool 110 while presenting a list of candidate results 112.In at least one embodiment, the list of candidate results 112 can becalculated in real-time, or near-real, and returned to the client as aprogressive list of candidate results 112. Candidate images and/orcandidate text associated with candidate results can be presented inlisting 112 in a user interface on the screen of mobile device 104. Inthe illustrated example, mobile device 104 represents a Windows Phones)device, although other mobile phones, smart phones, tablet computers,and other such mobile devices may similarly be employed. On mobiledevice 104, activation of a hard or soft button can indicate a desire toinitiate mobile video search tool 110.

In the example implementation of FIG. 1, mobile video search tool 110 isshown capturing audio input via a microphone of the mobile device, asrepresented by the microphone graphic 114. Although in otherimplementations, audio capture may be represented by a different graphicor simply understood without a corresponding graphic. Meanwhile, mobilevideo search tool 110 is capturing video input via a camera of themobile device, as is apparent from the user interface displaying thevisual capture 116. While the mobile video search tool continues tocapture audio input and visual input, the mobile video search tool canextract an audio fingerprint of the audio input and visual hash bits ofthe visual input to send toward the cloud for use in searching, forexample to search a LAVE indexed dataset. In addition, while the mobilevideo search tool continues to capture audio input and visual input, themobile video search tool can receive a progressive list of candidatesearch results 112. Candidate images and/or candidate text associatedwith candidate results can be presented in listing 112 in a userinterface on the screen of mobile device 104. In the illustratedembodiment, a progressive list of candidate search results 112 includingcandidate images and candidate text is presented beside the visualcapture in the user interface, although other presentation locations arecontemplated.

In various embodiments, to optimize memory, the mobile device 104 doesnot store the audio input or visual input, and instead the mobile device104 stores the audio fingerprint and the visual hash bits. Storing theaudio fingerprint and visual hash bits can be useful for low orinconsistent bandwidth conditions, or times when the device lacks anetwork connection.

Previously, global features have been adopted for searchingnear-duplicate videos, where videos are represented by compact globalsignatures. Such global features have included a spatiotemporal featurethat leverages gray-level intensity distribution with respect totimeline to represent videos and a combination of spatial and temporalinformation to construct invariant global signatures. Although theseglobal representations achieve fast retrieval speeds in a large-scalevideo dataset, they do not accommodate recorded query videos withserious distortions.

Compared with global features, local descriptors are more distinctiveand robust to recorded query video distortions as they explore the localinvariance, such as scale and orientation. However, due to thecomputational complexity, efficiency of employing local descriptors forrecorded query videos that may contain distortions becomes intractable.Several approaches have attempted to improve the speed of localdescriptor matching including Bag-of-Words (BoW) and construction of ahierarchy structure to speed up the matching process. However, localdescriptor based approaches require extensive optimization to operate onmobile devices due to the limited computing capability and memory ofmobile devices.

Audio can play an important role in near-duplicate video searching. Oneexample employs a landmark-based audio fingerprint to conduct a similaraudio search, and another example includes a bag of audio words (BoA)representation, inspired by BoW, to characterize audio features forsimilar video search. Compared to visual features, audio features can bemore robust, computationally efficient, and compact, which makes audiofeatures suitable to employ in mobile video search.

Recently, joint audio-visual near-duplicate video search has beenapplied for large-scale video copy detection. The key problem of featurecombination is the identification of the correlation between audio andvideo features. Existing fusion strategies include early fusion and latefusion. Both early fusion and late fusion strategies have disadvantages.For example, early fusion does not preserve structural information ofindividual features while late fusion does not recognize correlationamong features.

Existing early fusion and late fusion methods cannot sufficiently minethe advantage of audio-video signatures such that existingnear-duplicate video search methods can be directly adapted for mobilevideo search to deal with unique mobile challenges.

FIG. 2 is a pictorial diagram of the example stabilized results listing200 in the user interface of mobile video search of the embodiment ofFIG. 1.

Compared to the above methods, mobile video search techniques andfacility as described herein provide progressive mobile video searchwhile video input is being captured. The mobile video search schemeprogressively transmits compact audio-video signatures which can bederived from audio fingerprints and visual hash bits, to the cloud. TheLAVE indexing technique exploits the advantage of the audio-videosignature for robust video search. Moreover, to improve users' searchexperience, a progressive query process employs a two-part graph-basedtransformation and matching method.

Accordingly, in various implementations the mobile video search toolleverages audio input to help users accelerate a query by employingLandmark-Based Audio Fingerprinting (LBAF) to obtain audio fingerprints.

In an example implementation, candidate images associated withstabilized candidate results can be presented in a listing 200 in a userinterface on the screen of mobile device 104 as shown at 204. Meanwhile,text associated with candidate results, e.g., titles, character names,etc., are presented in a listing 200 in a user interface on the screenof mobile device 104 as shown at 206. In the example shown, a resultslisting includes candidate images 204 and corresponding titles 206presented in a horizontal ribbon format, from which a particularcandidate result can be selected by dragging onto a search area 202 orby touching or otherwise selecting either the image or text on thescreen of mobile device 104. However, other formats are both possibleand contemplated. For example, selection of a candidate image can causea browser to open and provide an opportunity for a user to buy or rent acopy of the selection for viewing on the mobile device and/or selectionof a text or title can bring up information about the associated videoor store the title, with or without the associated image, for lateraccess.

Illustrative Architecture

The architecture described below constitutes but one example and is notintended to limit the claims to any one particular architecture oroperating environment. Other architectures may be used without departingfrom the spirit and scope of the claimed subject matter. FIG. 3 is apictorial diagram of an example architecture for implementing mobilevideo search.

In some embodiments, the various devices and/or components ofenvironment 300 include one or more network(s) 302 over which a mobilecomputing device 304, which can correspond to mobile computing device104 and is also referred to herein as a client device 304 or simply adevice 304, may be connected to at least one server 306. The environment300 may include multiple networks 302, a variety of devices 304, and/ora plurality of servers 306.

In various embodiments, server(s) 306 can host a cloud-based service ora centralized service particular to an entity such as a school system ora company. Embodiments support scenarios where server(s) 306 can includeone or more computing devices that operate in a cluster or other groupedconfiguration to share resources, balance load, increase performance,provide fail-over support or redundancy, or for other purposes overnetwork 302.

For example, network(s) 302 can include public networks such as theInternet, private networks such as an institutional and/or personalintranet, or some combination of private and public networks. Network(s)302 can also include any type of wired and/or wireless network,including but not limited to local area networks (LANs), wide areanetworks (WANs), satellite networks, cable networks, Wi-Fi networks,WiMax networks, mobile communications networks (e.g., 3G, 4G, and soforth) or any combination thereof. Network(s) 302 can utilizecommunications protocols, including packet-based and/or datagram-basedprotocols such as internet protocol (IP), transmission control protocol(TCP), user datagram protocol (UDP), or other types of protocols.Moreover, network(s) 302 can also include a number of devices thatfacilitate network communications and/or form a hardware basis for thenetworks, such as switches, routers, gateways, access points, firewalls,base stations, repeaters, backbone devices, and the like.

In some embodiments, network(s) 302 can further include devices thatenable connection to a wireless network, such as a wireless access point(WAP). Embodiments support connectivity through WAPs that send andreceive data over various electromagnetic frequencies (e.g., radiofrequencies), including WAPs that support Institute of Electrical andElectronics Engineers (IEEE) 802.11 standards (e.g., 802.11g, 802.11n,and so forth), and other standards.

Computer Readable Media

Computer-readable media, as the term is used herein, includes, at least,two types of computer-readable media, namely computer storage media andcommunications media.

Computer storage media includes volatile and non-volatile, removable andnon-removable media implemented in any method or technology for storageof information such as computer readable instructions, data structures,program modules, or other data. Computer storage media includes tangibleand/or physical forms of media included in a device and/or hardwarecomponent that is part of a device or external to a device, includingbut not limited to random-access memory (RAM), static random-accessmemory (SRAM), dynamic random-access memory (DRAM), phase change memory(PRAM), read-only memory (ROM), erasable programmable read-only memory(EPROM), electrically erasable programmable read-only memory (EEPROM),flash memory, compact disc read-only memory (CD-ROM), digital versatiledisks (DVDs), optical cards or other optical storage media, magneticcassettes, magnetic tape, magnetic disk storage, magnetic cards or othermagnetic storage devices or media, solid-state memory devices, storagearrays, network attached storage, storage area networks, hosted computerstorage or any other storage memory, storage device, and/or storagemedium or memory technology or any other non-transmission medium thatcan be used to store and maintain information for access by a computingdevice.

In contrast, communication media may embody computer-readableinstructions, data structures, program modules, or other data in amodulated data signal, such as a carrier wave, or other transmissionmechanism.

As defined herein, computer storage media does not include communicationmedia exclusive of any of the hardware components necessary to performtransmission. That is, computer storage media does not includecommunications media consisting solely of a modulated data signal, acarrier wave, or a propagated signal, per se.

In various embodiments, mobile computing devices 304 include devicessuch as devices 304A-304E. Embodiments support scenarios where device(s)304 can include one or more computing devices that operate in a clusteror other grouped configuration to share resources or for other purposes.Although illustrated as a diverse variety of mobile device types,device(s) 304 can be other mobile device types and are not limited tothe illustrated mobile device types. Device(s) 304 can include any typeof mobile computing device with one or multiple processor(s) 308operably connected to an input/output interface 310 andcomputer-readable media 312. Devices 304 can include mobile computingdevices such as, for example, smartphones 304A, laptop computers 304B,tablet computers 304C, telecommunication devices 304D, personal digitalassistants (PDAs) 304E, and/or combinations thereof. Devices 304 canalso include electronic book readers, wearable computers, automotivecomputers, gaming devices, mobile thin clients, terminals, and/or workstations. In some embodiments, devices 304 can be other than mobiledevices and can include, for example, desktop computers and/orcomponents for integration in a computing device, appliances, or anothersort of device.

In some embodiments, as shown regarding device 304A, computer-readablemedia 312 can store instructions executable by the processor(s) 308including an operating system 314, an engine for mobile video search316, and other modules, programs, or applications 318 that are loadableand executable by processor(s) 308 such as a CPU or a GPU.Alternatively, or in addition, the functionally described herein can beperformed, at least in part, by one or more hardware logic components.For example, and without limitation, illustrative types of hardwarelogic components that can be used include Field-programmable Gate Arrays(FPGAs), Program-specific Integrated Circuits (ASICs), Program-specificStandard Products (ASSPs), System-on-a-chip systems (SOCs), ComplexProgrammable Logic Devices (CPLDs), etc.

The computer-readable media 312 in various embodiments may includecomputer storage media, which in turn may include volatile memory,nonvolatile memory, and/or other persistent and/or auxiliary computerstorage media as discussed above. Thus, computer-readable media 312 whenimplemented as computer storage media, includes tangible and/or physicalforms of media included in a device and/or hardware component that ispart of a device or external to a device, including but not limited torandom access memory (RAM), static random-access memory (SRAM), dynamicrandom-access memory (DRAM), read-only memory (ROM), erasableprogrammable read-only memory (EPROM), electrically erasableprogrammable read-only memory (EEPROM), flash memory, compact discread-only memory (CD-ROM), digital versatile disks (DVDs), optical cardsor other optical storage media, magnetic cassettes, magnetic tape,magnetic disk storage, magnetic cards or other magnetic storage devicesor media, solid-state memory devices, storage arrays, network attachedstorage, storage area networks, hosted computer storage or any otherstorage memory, storage device, and/or storage medium that can be usedto store and maintain information for access by a computing device.However, computer-readable media 312 when implemented as computerstorage media does not include communications media consisting solely ofpropagated signals, per se.

Device(s) 304 can further include one or more input/output (I/O)interfaces 310 to allow a device 304 to communicate with other devices.Input/output (I/O) interfaces 310 of a device 304 can also include oneor more network interfaces to enable communications between computingdevice 304 and other networked devices such as other device(s) 304and/or server(s) 306 over network(s) 302. Input/output (I/O) interfaces310 of a device 304 can allow a device 304 to communicate with otherdevices such as user input peripheral devices (e.g., a keyboard, amouse, a pen, a game controller, an audio input device, a visual inputdevice, a touch input device, gestural input device, and the like)and/or output peripheral devices (e.g., a display, a printer, audiospeakers, a haptic output, and the like). Network interface(s) caninclude one or more network interface controllers (NICs) or other typesof transceiver devices to send and receive communications over anetwork.

Server(s) 306 can include any type of computing device with one ormultiple processor(s) 320 operably connected to an input/outputinterface 322 and computer-readable media 324. In some embodiments, asshown regarding server(s) 306, computer-readable media 324 can storeinstructions executable by the processor(s) 320 including an operatingsystem 326, a framework for a layered audio-video engine 328, and othermodules, programs, or applications 330 that are loadable and executableby processor(s) 320 such as a CPU and/or a GPU. Alternatively, or inaddition, the functionally described herein can be performed, at leastin part, by one or more hardware logic components. For example, andwithout limitation, illustrative types of hardware logic components thatcan be used include Field-programmable Gate Arrays (FPGAs),Program-specific Integrated Circuits (ASICs), Program-specific StandardProducts (ASSPs), System-on-a-chip systems (SOCs), Complex ProgrammableLogic Devices (CPLDs), etc.

The computer-readable media 324 when implemented as computer storagemedia may include volatile memory, nonvolatile memory, and/or otherpersistent and/or auxiliary computer-readable storage media. Server(s)306 can further include one or more input/output (I/O) interfaces 322 toallow a server 306 to communicate with other devices such as user inputperipheral devices (e.g., a keyboard, a mouse, a pen, a game controller,an audio input device, a video input device, a touch input device,gestural input device, and the like) and/or output peripheral devices(e.g., a display, a printer, audio speakers, a haptic output, and thelike). Input/output (I/O) interfaces 310 of a server 306 can alsoinclude one or more network interfaces to enable communications betweencomputing server 306 and other networked devices such as other server(s)306 or devices 304 over network(s) 302.

In various embodiments, server(s) 306 can represent a cloud basedservice or a centralized service particular to an entity such as aschool system or a company. Server(s) 306 can include programming tosend a user interface to one or more device(s) 304. Server(s) 306 canstore or access a user profile, which can include information a user hasconsented the entity collect such as a user account number, name,location, and/or information about one or more client device(s) 304 thatthe user can use for sensitive transactions in untrusted environments.

Example Mobile Device

FIG. 4 illustrates select components of an example mobile device 104configured to provide a mobile video search facility as describedherein. Example mobile device 304 includes a power supply 402, one ormore processors 404, which can correspond to processor(s) 308 and caninclude microprocessors, and input interfaces corresponding toinput/output interface 310 including a network interface 406, one ormore cameras 408, one or more microphones 410, and in some instancesadditional input interface 412 can include a touch-based interfaceand/or a gesture-based interface. Example mobile device 304 alsoincludes output interfaces corresponding to input/output interface 310including a display 414 and in some instances may include additionaloutput interface 416 such as speakers, a printer, etc. Network interface406 enables mobile device 304 to send and/or receive data over network302. Network interface 406 may also represent any combination of othercommunication interfaces to enable mobile device 304 to send and/orreceive various types of communication, including, but not limited to,web-based data and cellular telephone network-based data. In additionexample mobile device 304 includes computer-readable media 418, which insome embodiments corresponds to computer-readable media 312.Computer-readable media 418 stores an operating system (OS) 420, abrowser application 422, a mobile video search tool 316, and any numberof other applications or modules 424, which are stored incomputer-readable media 418 as computer-readable instructions, and areexecuted, at least in part, on processor 404.

Browser application 422 represents any of a variety of applications thatcan be executed on mobile device 304 to provide a user interface throughwhich web content available over the Internet may be accessed.

Other applications or modules 424 may include any number of otherapplications that are executable on the mobile device 304. Such otherapplications may include, for example, an email application, a calendarapplication, a transactions module, a music player, a cameraapplication, a calculator, one or more games, one or more productivitytools, a messaging application, an accelerometer, and so on.

Mobile video search tool 316 includes one or more of audio extractionmodule 426, video extraction module 428, signature module 430, resultsmodule 432, user interface module 434, and any number of other mobilevideo search modules 436. Audio extraction module 426 can extract anaudio fingerprint such as LBAF.

Video extraction module 428 employs a video descriptor that is robust todistortions such as motion, blur, and inconsistent lighting conditionsas well as quickly extracted. Video extraction module 428 can extractraw features such as Speeded-Up Robust Features (SURF) features fromlocal video features. However, sending raw SURF features may cause amobile device to consume an unacceptably high amount of energy and itmay take too long to be acceptable to users. In various embodiments thevideo extraction module uses hashing methods to compress the localfeatures to hash bits, consistent with the light computation and memoryresources of mobile computing device 104.

Signature module 430 operates consistent with, and may make up all or apart of the programming to perform a LAVE search based at least on theaudio fingerprint from audio extraction module 426 and/or visual hashbits from video extraction module 428.

User interface module 434 operates consistent with, and may make up allor a part of the programming for operation of other mechanical and/orsoftware user interface components of the mobile device 104. Forexample, user interface module 434, which can be executed by processor404, can control the functions of a hard or soft selection button, ahome screen button, a back button, and/or a start button in the contextof the mobile video search tool 316. User interface module 434 enablespresentation and selection of particular listings of the candidateresults listings received by results module 432. For example, userinterface module 434 provides for presentation and selection ofparticular candidate listings presented in a scrollable ribbon format onthe screen of mobile device 104 as shown at 112 and/or 200.

In some embodiments, other interactive multi-modal image searchcomponents 436 can apply the context of other interactive data toperform a mobile video search. For example, other context data that canbe used may include, but is not limited to, recent searches, messaginginformation, data that identifies recently accessed applications (e.g.,browser search, movie listing apps, etc.), and so on.

Although illustrated in FIG. 4 as being stored on computer-readablemedia 418 of mobile device 304, in some implementations, mobile videosearch tool 316, or portions thereof, can be stored on one or moreservers 306 and/or executed via a cloud based implementation. Inaddition, in some implementations, mobile video search tool 316, orportions thereof, can be implemented using any form of computer-readablemedia that is accessible by mobile device 304. Furthermore, in someembodiments, one or more components of operating system 420, browserapplication 422, mobile video search tool 316, and/or other applicationsor modules 424 may be implemented as part of an integrated circuit thatis part of, or accessible to, mobile device 304. Furthermore, althoughillustrated and described as being implemented on a mobile device 304,in some embodiments, the data access and other functionality provided bymobile video search tool 316 as described herein may also be implementedon any other type of computing device that is configured for audio andvisual input and through which a user can perform a video search,including, but not limited to, desktop computer systems, gaming systems,and/or television systems.

Example Server Device

FIG. 5 is a block diagram that illustrates select components of anexample server device 306 configured to provide layered audio-videoindexing as a mobile video search facility as described herein. Exampleserver 306 includes a power supply 502, one or more processors 504,which can correspond to processor(s) 320 and can includemicroprocessors, and input interfaces corresponding to input/outputinterface 322 including a network interface 506, and in some instancesmay include one or more additional input interfaces 508 such as akeyboard, soft keys, a microphone, a camera, etc. In addition to networkinterface 506, example server device 306 can also include one or moreadditional output interfaces 510 corresponding to input/output interface322 including output interfaces such as a display, speakers, a printer,etc. Network interface 506 enables server 306 to send and/or receivedata over a network 302. Network interface 506 may also represent anycombination of other communication interfaces to enable server 306 tosend and/or receive various types of communication, including, but notlimited to, web-based data and cellular telephone network-based data. Inaddition example server 306 includes computer-readable media 512, whichin some embodiments corresponds to computer-readable mediacomputer-readable media 324. Computer-readable media 512 stores anoperating system (OS) 514, a LAVE index 516, a layered audio-videoengine 328, and any number of other applications or modules 518, whichare stored on computer-readable media 512 as computer-executableinstructions, and are executed, at least in part, on processor 504.

Other applications or modules 518 may include any number of otherapplications that are executable on the server 306. Such otherapplications may include, for example, an email application, a calendarapplication, a transactions module, a music player, a cameraapplication, a calculator, one or more games, one or more productivitytools, a messaging application, an accelerometer, and so on.

Layered audio-video engine 328 includes at least one of audio extractionmodule 524, video extraction module 526, LAVE search module 528,geometric verification module 530, progressive query module 532, anddecision module 534.

Although illustrated in FIG. 5 as being stored on computer-readablemedia 512 of server 306, in some implementations, layered audio-videoengine 328, or portions thereof, can be stored on one or more additionalservers 306 and/or executed via a cloud based implementation. Inaddition, in some implementations, layered audio-video engine 328, orportions thereof, can be implemented using any form of computer-readablemedia that is accessible by server 306. Furthermore, in someembodiments, one or more components of operating system 514, LAVE index516, and/or other applications or modules 518 may be implemented as partof an integrated circuit that is part of, or accessible to, server 306.Furthermore, although illustrated and described as being implemented ona server 306, in some embodiments, the data access and otherfunctionality provided by layered audio-video engine 328 as describedherein may also be implemented on any other type of computing devicethat is configured for audio and visual indexing and that can perform avideo search based on video query input, including, but not limited to,desktop computer systems, head end television distribution systems, andlaptop computer systems.

FIG. 6, at 600, is a pictorial diagram of an example framework of amobile device implementing mobile video search and a server or cloudcomputing environment, which may also be referred to as anetwork-distributed environment, implementing layered audio-videoindexing. Framework 600 is illustrated with an offline stage 602 and anonline stage 604. Framework 600 can include at least one server 606,which in various embodiments corresponds to server(s) 306, and mayinclude, for example, a web server, an application server, and anynumber of other data servers. Meanwhile, framework 600 can include atleast one client 608, which in various embodiments corresponds todevice(s) 104 and/or 304.

In various embodiments, client 608 is representative of any type ofmobile computing device configured to transmit and receive data over anetwork such as over network 302. For example, client 608 may beimplemented as a mobile phone, a smartphone, a personal digitalassistant (PDA), a netbook, a tablet computer, a handheld computer, andother such mobile computing devices characterized by reduced form factorand resource limitations.

In the offline stage 602, the power of cloud computing can be used tostore a large-scale source video dataset 610, which may include manythousands of videos. At 612, a layered audio-video indexing applicationsuch as LAVE 328 extracts the audio-video descriptors for individual ofthe videos from large-scale source video dataset 610. Effective jointaudio-video descriptors will be robust to the variance of query videosfrom complex mobile video capturing conditions (e.g., silent video orblurred video of low visual quality) in a mobile video search system. Invarious embodiments, joint descriptor selection is based, at least inpart, on three characteristics: 1) robust to the variance of therecorded query videos, 2) cheap to compute on mobile devices, and 3)easy to index for mobile video search. In at least one embodiment, theLAVES application employs Landmark-Based Audio Fingerprinting (LBAF) toobtain audio fingerprints 614 and Speeded-Up Robust Features (SURF) toobtain visual hash bits 616. At 618 LAVE application 328 builds andstores a LAVE index 620 using these descriptors.

The online query stage 604 includes the following operations which canbe performed while a client device 608, such as device 304, capturesquery video clips 622: 1) Real time extraction of light-weightaudio-video descriptors on the mobile device 624. The mobile videosearch tool 316 sends the audio-video signature (including visual hashbits 626 and audio fingerprint 628) toward server 606. In variousembodiments mobile video search tool 316 sends the signature atpredetermined intervals, e.g., at an interval of two seconds, at aninterval of one second, at an interval of one-half second, etc. 2) Theserver 606 receives the signature, e.g., the two-second signature, theone-second signature, the half-second signature, etc. As shown at 630,server 606 conducts the search for similar video key frames 632 throughthe LAVE index 620. 3) As shown at 634, server 606 uses geometricverification-based visual ranking to refine the search results.Geometric verification-compares query characteristics 636 to sourcecharacteristics 638. For each matched query, e.g., a one second queryand source video key-frames, one node in a two-part graph can representthe received query and another node can represent a candidate matchingkey frame from the source video. In the graph, an edge connects thequery node to the candidate matching key frame node. 4) As shown at 640,server 606 performs a progressive query process via two-part graphtransformation and matching to make the video search progressive. Theparticulars of progressive query process 640 are shown in Algorithm 1.For example, if a new query arrives, a new query node will be added at636. Then, the edges of the two-part graph will be updated according tothe returned result. During progressive query 640, if the number ofedges of the two-part graph does not change, a similarity score of thematched video will not change; otherwise, the similarity score of thematched video will be updated.

At 642, if there are no changes in the search results and/or thesimilarity score for a period of time, e.g., for a predetermined periodof two consecutive seconds, for three consecutive seconds, for fourconsecutive seconds, the decision module 534 determines that a stablesearch result has been achieved. In some embodiments, at 642, if thereare no changes in the search results and/or the similarity score for aperiod of time, e.g., for a variable period of time and/or a relativeperiod of time, the decision module 534 determines that a stable searchresult has been achieved. When a stable search result is achieved, thesearch process can cease automatically, and at 644 the results will bereturned to the mobile device. In some embodiments when the searchresult is stable the results are returned for presentation on clientdevice 608 in a user interface that signals the search is stable asshown at 646. However, when the search results are not stable for thedesired time, the search process continues and at 648 decision module534 returns the results to the device 608 in a manner to indicate thatthe search is not complete as shown in the user interface during videocapture 622.

In the illustrated example, a client device 608, such as mobile device304 receives a video input via a microphone and camera to initiate avideo query, as shown at 622. The system employs an audio extractionmodule such as module 426 to extract an audio fingerprint such as LBAFas shown at 628. The system also employs video extraction module such asmodule 428 to extract visual hash bits as shown at 626.

With regard to visual hash bits, video extraction modules such as videoextraction module 428 and/or video extraction module 526 can use hashingmethods to compress the local features to hash bits. For example, thevideo extraction module can use Minimal Loss Hashing or Spectral Hashingto learn a hash function such as that represented byh^(v)=sign(v^(t)x−t), where x represents the SURF descriptor vector, vrepresents the learned hash matrix, and t represents the thresholdscalar, to calculate h^(v), which represents the learned visual hashbits. In some embodiments, the video extraction module can limit thebinary code to 80 bits. In such embodiments, the video extraction modulecan use eight bits to save the angle value of the SURF descriptor, whichwill be used for geometric verification in the future as discussedregarding 634 of FIG. 6. Therefore, the video extraction module cancompress each SURF feature to v_(i)={h_(l) ^(v), r_(i) ^(y)}, which inthe discussed example can be just 88 bits.

The video extraction module can scale the query image to a small pictureto minimize differences due to different camera resolutions on variousmobile devices. Scaling the query image to a small picture can improvefeature extraction speed on the mobile device and it can decrease thenumber of feature points that need to be transmitted. In severalimplementations, such scaling improves query speed with little influenceon precision. For example, after the scaling, there is an average of 75SURF points for one frame, which allows the mobile device to transmitless than 1 KB of visual features to the server for the frame.

FIG. 7 is a pictorial diagram of an example of extraction of an audiofingerprint. Among various audio features, LBAF is widely used in manynear-duplicate video search methods. Its fast computation, efficientmemory and invariant translation are also suitable for mobile videosearch. In various implementations, an audio extraction module such asaudio extraction module 426 and/or audio extraction module 524 extractsan audio fingerprint such as LBAF. At 702, the audio extraction modulesegments the audio information into short and partly overlapping framesof length f_(m) _(l) and stride f_(m) _(d) . At 704, the audioextraction module calculates a spectrogram for each frame. At 706, theaudio extraction module sets candidate peaks such as 708 a and 708 b onthe spectrogram of the frame. In several embodiments the audioextraction module sets the candidate peaks on the spectrogram of theframe according to at least three criteria: higher energy content thanall its neighbors, higher amplitude than its neighbors, and a densitycriterion. At 710, the audio extraction module chooses an anchor point712 from the peaks and identifies a corresponding target zone 714 forthe anchor point. Each anchor point 712 is sequentially paired with thecandidate peak in its target zone 714. The anchor point-candidate peakpairs may be called landmarks. Each landmark can be represented asl_(i)={t_(i) ^(a), f_(i) ^(a), Δt_(i) ^(a), Δf_(i) ^(a)}, where t_(i)^(a) and f_(i) ^(a) are the time offset and the frequency of the anchorpoint, and Δt_(i) ^(a) and Δf_(i) ^(a) are the time and frequencydifferences between the anchor point and the paired point in the targetzone. The audio extraction module can compress the fingerprint intol_(i)={h_(k) ^(a), t_(i) ^(a)} where h_(k) ^(a) is the hash value of thef_(i) ^(a), Δt_(i) ^(a) and Δf_(i) ^(a). Different l_(i) may have thesame h_(k) ^(a).

In one implementation, f_(m) _(l) =256 ms and f_(m) _(d) =32 ms, with alimit on hash bits h_(k) ^(a) to less than 25 bits. As there are 15 bitsfor t_(i) ^(a), the length of l_(i)=40 bits. In at least one embodiment,for a one-second audio clip, the audio extraction module may choose 100landmarks in total. Hence, the audio extraction module can reduce theamount of data to transmit to just 0.5 KB per second for audiofingerprinting.

In this example, through feature extraction, the mobile device obtains100 audio feature points and 75 visual feature points, which throughefficient compression represents less than 2 KB of audio-visualsignatures per second of video content to be transmitted over thenetwork.

FIG. 8 is a pictorial diagram of an example of a layered audio-video(LAVE) indexing scheme 800. As shown at 800, the LAVE scheme employs twolayers 802 and 804. The first layer 802 represents an index entry, whichcontains a multi-index made up of audio index 806 and visual index 808.The second layer 804 represents visual hash bits, which includes secondlayer visual indexing 810. The LAVE scheme uses the visual hash bits ofsecond layer visual indexing 810 for feature matching and combination.After the searching in the first layer, the system can obtain refinedsimilar visual feature points from the audio index and from the visualindex. Accordingly, combination in this context includes fusing therefined similar visual feature points from the audio index and from thevisual index together and selecting the most (top K) similar visualfeature points from them.

There are two advantages to these structures: 1) the structures improvethe visual points search speed by employing the hierarchicaldecomposition strategy, and 2) the structures exploit the complementarynature of audio and visual signals. The different indexing entries inthe first layer 802 preserve the individual structure of audio andvisual signatures. In the second layer 804, the combination of audio andvisual can be weighted by the hamming distance of visual hash bits.

Building a LAVE Index

In various embodiments, in contrast to visual features, the audiofeature can be highly compressed, for example with just 25 bits torepresent each point. The compression allows, the LAVE search module 528to conduct a linear search of the audio index. To build a LAVE index516, a process such as that shown at 618 can use the audio index as partof the first layer 802 and each bucket, e.g., 806 a, h_(k) ^(a), in theaudio index of the first layer can be associated with the second layerby the video ID, audio time offset t^(a) and key frame number t^(v),e.g., 806 aa, ID_(i), t_(i) ^(a), and 806 aa′, ID_(i′), t_(i′) ^(a), andso on for 806 b, e.g., h_(k+1) ^(a), 806 c, e.g., h_(k+2) ^(a), etc.Through the audio indexing, the layered audio-video engine 328 canrefine the number of visual points to be searched in the second layer,which improves the search speed.

However, the audio information being changed significantly or missed canmake it difficult to find the closet neighbor in the second layer.Layered audio-video engine 328 uses a multi-index to solve this problem.Layered audio-video engine 328 indexes the hash bits from the secondlayer visual index by m different hash tables, which construct thevisual index of the first layer. Layered audio-video engine 328 randomlyselects the hash bits, h_(n) ^(sub) of the visual index in the firstlayer e.g., 808 a h_(n) ^(sub), 808 a′ h_(n′) ^(sub), 808 b h_(n+1)^(sub), 808 b′ h_(n′+1) ^(sub), 808 c h_(n+2) ^(sub), 808 c′ h_(n′+2)^(sub), 808 d h_(n+3) ^(sub), 808 d′ h_(n′+3) ^(sub), 808 e h_(n+4)^(sub), 808 e′ h_(n′+4) ^(sub), 808 f h_(n+5) ^(sub), 808 f′ h_(n′+5)^(sub), etc., from the hash bits in the second layer. For a receivedvisual point, entries that fall close to the query in at least one suchhash table are considered neighbor candidates. Layered audio-videoengine 328 then checks the candidates for validity using the secondlayer index 810, e.g., 810 a ID_(l), t_(m) ^(v), 810 a′ h_(n) ^(v), 810a″ r_(i) ^(v), 810 b ID_(l), t_(m) ^(v), 810 b′ h_(n+1) ^(v), 810a″r_(i+1) ^(v), 810 c ID_(l), t_(m) ^(v), 810 c′ h_(n+2) ^(v), 810 c″r_(i+2) ^(v). In contrast to existing techniques, layered audio-videoengine 328 employs m+1 multi-indexes: m visual indexes and one audioindex. All the results refined by the m+1 multi-index are combinedtogether in the second layer and the top N similar results are selected.The audio index reduces the number m for the visual index. In at leastone implementation, the facility operates with one visual index.

Searching a LAVE Index

In various embodiments, the search process in LAVE indexing can bepresented as follows. Let P_(a)={l₁, l₂, . . . , l_(M)} represent thereceived audio query points and P_(v)={v₁, v₂, . . . , v_(L)} representthe received visual query points. Through a search process, such assearch process 630, LAVE search module 528 can return the top K visualpoints for each query visual point.

Step 1, for each audio point l_(m) in P_(a), LAVE search module 528acquires the nearest approximate neighbors by a linear search in theaudio index. Then LAVE search module 528 assigns the matching pairs todifferent candidate clusters C={c₁, c₂, . . . , c_(N)}. LAVE searchmodule 528 assigns two pairs to the same cluster if their nearestapproximate neighbors come from the same video.

Step 2 LAVE search module 528 reorders the clusters by temporalverification. For example, LAVE search module 528 can represent temporaldistance by Δt to denote the time difference of the two LBAFs in thematching pairs. The histogram of Δt can be computed for all pairs inc_(n) and the score of c_(n) equals h_(n)/M, where h_(n) represents themaximum value of the histogram. This score can also be used forsimilarity computation. Then the top K′ candidate clusters are chosen.The buckets associated with the top K′ candidate clusters in the secondlayer can be regarded as a subset.

Step 3, for each v₁ in P_(v), the K LAVE search module 528 can obtainnearest approximate neighbors as follows: a) Top K approximate neighborscan be determined by linear search in the subset of the second layer. b)Use the multi-index indexing method to search other top K nearestneighbor points. c) The 2K nearest neighbor points can be reordered bysimilar distance, and the top K nearest points can be selected.

Step 4, LAVE search module 528 can return the top K nearest visualpoints as the search results.

In summary, according to the process, LAVE search module 528 combinesthe audio and visual information in two stages. The first stage is Step1-Step 3.a. In this stage, mobile video search uses the highercompressed audio information as a coarse filter and the morediscriminative visual information as the fine filter to improve theoverall search speed. Furthermore, as the similarity is computed inseparate layers, the combination stage can also preserve the individualstructure of each signature. The second stage is Step 3.b-Step 4. Incontrast to the first combination stage, which heavily depends on audiosearch accuracy, in the second stage, the combination of audio andvisual information can be weighted by the hamming distance of visualhash bits. The two stages exploit the complementary nature of the audioand visual signals for robust mobile video search. Due to the m+1multi-index, i.e., m visual indexes and one audio index, thecomputational complexity of searching the LAVE index can be based on themulti-index indexing method LAVE search module 528 uses to search thenearest visual neighbor points.

Geometric Verification

In various embodiments, geometric verification such as geometricverification 634 by geometric verification module 530 can be presentedas follows. Geometric verification can use the top N points, with theHough Transfer method to get similar source key frames of the query,and, a subsequent geometric verification (GV) 634 considering spatialconsistency of local features can be used to reject false-positivematches. In order to reduce the time consumption of GV, geometricverification module 530 can employ a fast and effective GV based rankingstep to find the most similar image. In at least one implementation, themethod utilizes the orientation of descriptors, such that the locationinformation of the local features need not be transmitted over thenetwork. The method hypothesizes two matched descriptors of duplicateimages should have the same orientation difference. So for two duplicateimages, geometric verification module 530 calculates the orientationdistance Δθ_(d) between each matched local feature pair. Then geometricverification module 530 quantizes all Δθ_(d) into C bins, e.g., C=10.Furthermore, geometric verification module 530 scans the histogram for apeak and sets the global orientation difference as the peak value.Geometric verification module 530 obtains the geometric verificationscore from the number of the pairs in the peak, which is normalized bythe number of total pairs.

Progressive Query

In various embodiments, a progressive query process such as progressivequery 640 is performed by progressive query module 532. In contrast toexisting mobile video search systems, (i.e., search after achieving allthe query data), a progressive query process as described herein cansignificantly reduce the query cost and improve users' searchexperience. Progressive query module 532 can advance to the next queryand dynamically calculate retrieval results, for example after or inresponse to arrival of each query has. Search can cease when a stableresult is achieved.

Algorithm 1 provides an example progressive query process for at leastone embodiment.

ALGORITHM 1 Algorithm 1 Progressive Query Process Input: a new queryq_(k+1), Output: top K, nearest videos 1: add q_(k+1) to Q 2: searcnq_(k+1), get R_(k+1) 3: add R_(k+1) to R 4: for each s_(n,m) in R_(k+1)do 5: find the G_(i) contains S_(n,m) 6: add q_(k+1) <−> ^(s) to E_(i)7: end for 8: call W = VideoSimilarScore(G) 9: return top K nearestvideos Procedure VideoSimilarScore(G) 1: for each G_(i) in G do 2: if|E_(i)| is changed then 3: calculate the MSM M_(i) 4: if |M_(i)| > athen 5: update W_(i) = S_(im)(Q, V_(i), W_(i) ^(a) W_(i) ^(v)) 6: end if7: end if 8: end for 9: return W

In a layered audio-video system as described herein, the progressivequery process can be implemented via a two-part graph transformation andmatching algorithm. As shown in FIG. 6, for each matched query andsource video, progressive query module 532 can use a two-part graphG={N, E} to represent the matching. In the two-part graph, a query node,636, can be represented by q_(k) ∈ Q, and denotes the received query attime k, a source node, 638, can be represented by s_(n,m) ∈ S, anddenotes the mth key frame in source video V_(n). Let R_(k) denote allthe returned similar key frames S_(n,m) of query q_(k). There will be anedge e_(k,m) ∈ E if S_(n,m) ∈ R_(k). After each second of timesearching, progressive query module 532 can update the two-part graphG_(i) and then the similarity score of the matching can be progressivelycalculated through G_(i).

Algorithm 1 illustrates one embodiment of particulars of the progressivequery process. If a new query arrives, a new query node will be added,such as at 636. Then, the edges of the two-part graph will be updatedaccording to the returned result. During progressive query 640, if thenumber of edges of the two-part graph does not change, a similarityscore of the matched video will not change; otherwise, the similarityscore of the matched video can be updated as follows: First of,progressive query module 532 can calculate Maximum Size Matching (MSM)M_(i) of G_(i). If |M_(i)|>α, progressive query module 532 can calculatea similarity score W_(i) according to equation 1.

$\begin{matrix}\begin{matrix}{W_{i} = {{Sim}\left( {Q,V_{i},W_{i}^{a},W_{i}^{v}} \right)}} \\{= {{{Sim}_{a}\left( {Q,V_{i},W_{i}^{a}} \right)} + {{Sim}_{v}\left( {Q,V_{i},W_{i}^{v}} \right)} +}} \\{{{Sim}_{t}\left( {Q,V_{i}} \right)}}\end{matrix} & (1)\end{matrix}$

In equation 1. Sim_(a)(Q, V_(i), W_(i) ^(a)) favors the audio contentsimilarity, which can be computed according to equation 2.

$\begin{matrix}{{{Sim}_{a}\left( {Q,V_{i},W_{i}^{a}} \right)} = \frac{\sum w_{k,i}^{a}}{Q}} & (2)\end{matrix}$

In equation 2, W represents the audio similarity between query q_(k) andvideo V_(i) and |Q| represents the query length. Sim_(v)(Q, V_(i), W_(i)^(v)) indicates the visual similarity according to equation 3.

$\begin{matrix}{{{Sim}_{v}\left( {Q,V_{i},W_{i}^{v}} \right)} = \frac{\sum w_{k,i}^{v}}{Q}} & (3)\end{matrix}$

In equation 3, w_(k,i) ^(v) represents the visual similarity betweenquery q_(k) and video V_(i), and Sim_(t)(Q, V_(i)) shows temporal ordersimilarity. This score assures that the matched video should have asimilar temporal order. Given MSM M_(i) of G_(k), its temporal matchingnumber can be calculated by, for example, a Longest Common Subsequence(LCSS). LCSS is a variation of the edit distance, which progressivequery module 532 can use to denote the number of frame pairs of M_(k)matched along the temporal order according to equation 4.

$\begin{matrix}{{{LCSS}\left( {i,j} \right)} = \left\{ \begin{matrix}0 & {i = {{0\mspace{14mu} {or}\mspace{14mu} j} = 0}} \\{{{LCSS}\left( {{i - 1},{j - 1}} \right)} + 1} & {e_{i,j} > 0} \\{\max \left\{ {{{LCSS}\left( {{i - 1},j} \right)},{{LCSS}\left( {i,{j - 1}} \right)}} \right\}} & {e_{i,j} = 0}\end{matrix} \right.} & (4)\end{matrix}$

Thus, Sim_(t)(Q, V_(i)) can be obtained according to equation 5.

$\begin{matrix}{{{Sim}_{t}\left( {Q,V_{i}} \right)} = {\frac{{LCSS}\left( {{Q},{V}} \right)}{Q}.}} & (5)\end{matrix}$

After computing all the similarities between Q and V, progressive querymodule 532 can return the top K videos as the search results. In variousembodiments, the computational complexity of the progressive queryprocess 640 as described herein is O(|G|×|N_(i)|×|E_(i)|), where |G|represents the number of two-part graphs, and |N_(i)| represents thenumber of vertices, while |E_(i)| represents the number of edges in eachtwo-part graph. However, in at least one implementation, the timeconsumed for the similarity calculation process is less thanO(|G|×|N_(i)|×|E_(i)|) because |E_(i)| does not change in most two-partgraphs.

Example Operation

FIGS. 9-11 illustrate example processes for implementing aspects ofmobile video search of a LAVE indexed dataset as described herein. Theseprocesses are illustrated as collections of blocks in logical flowgraphs, which represent a sequence of operations that can be implementedin hardware, software, or a combination thereof. In the context ofsoftware, the blocks represent computer-executable instructions on oneor more computer-readable media that, when executed by one or moreprocessors, cause the processors to perform the recited operations.

This acknowledges that software can be a valuable, separately tradablecommodity. It is intended to encompass software, which runs on orcontrols “dumb” or standard hardware, to carry out the desiredfunctions. It is also intended to encompass software which “describes”or defines the configuration of hardware, such as HDL (hardwaredescription language) software, as is used for designing silicon chips,or for configuring universal programmable chips, to carry out desiredfunctions.

Note that the order in which the processes are described is not intendedto be construed as a limitation, and any number of the described processblocks can be combined in any order to implement the processes, oralternate processes. Additionally, individual blocks may be deleted fromthe processes without departing from the spirit and scope of the subjectmatter described herein. Furthermore, while the processes are describedwith reference to the mobile device 304 and server 306 described abovewith reference to FIGS. 1-8, in some embodiments other computerarchitectures including other cloud-based architectures as describedabove may implement one or more portions of these processes, in whole orin part.

FIG. 9 illustrates an example process 900 for implementing a mobilevideo search tool on a client device such as device 304. Althoughprocess 900 is described as being performed on a client device, in someembodiments a system including a client device and a server, which mayinclude multiple devices in a network-based or cloud configuration asdescribed above, can perform aspects of process 900.

Aspects of a mobile video search tool as described herein can beimplemented as a search application running on the mobile device and/orvia an application programming interface (API) in some embodiments. Themobile video search tool can capture the video input for query andperform extraction of the audio fingerprint and visual hash bits to formthe audio-video signature. In the case of an application running on themobile device, the application can send the audio-video signature as thevideo search query. In the case of an API, the application can exposethe audio fingerprint and visual hash bits making up the audio-videosignature via an API for another application to use for video search. Inthat case, the application accessing the API for video search can sendthe audio-video signature as the video search query.

At block 902, a device such as device 304 configured to receive videocontent as input via a video search tool, such as mobile video searchtool 316, receives video content as input. In various embodimentsreceiving video content as input includes one or more input devices orcomponents such as a microphone 410 and/or a camera 408 associated withdevice 304 capturing audio input from the video content via themicrophone and/or capturing visual input from the video content via thecamera in time slices. In some embodiments receiving video content asinput includes receiving audio input and/or visual input associated withthe video content as exposed via an API. In several embodiments, thetime slices of video content are received by input devices associatedwith the device from a video output device not associated with thedevice. In various embodiments, a length of individual ones of the timeslices includes at least about 0.1 second and at most about 10.0seconds. In at least one embodiment, each time slice can represent onesecond of video content.

At block 904, the device, such as device 304, configured to extract anaudio-video descriptor for a time slice of the video content via anaudio-video extractor, such as one or more of an audio extraction module426 and/or a video extraction module 428, performs extraction includingof an audio-video descriptor for a time slice of the video content. Invarious embodiments extracting audio-video descriptors for the timeslices of video content includes obtaining aural and/or visualcharacteristics of the video content corresponding to the time slice.

In some embodiments, at block 906 the device, such as device 304,configured to extract aural characteristics for a time slice of thevideo content via an audio extraction module, such as audio extractionmodule 426, performs extraction including of an audio fingerprint of thevideo content corresponding to the time slice for use in generating anaudio-video signature.

In some embodiments, at block 908 the device, such as device 304,configured to extract visual characteristics for a time slice of thevideo content via video extraction module, such as video extractionmodule 428, performs extraction including of at least one visual hashbit of the video content corresponding to the time slice for use ingenerating an audio-video signature.

At block 910, the device, such as device 304, configured to generate anaudio-video signature via a signature generator, such as signaturemodule 430, generates an audio-video signature associated with one ormore of the time slices of video content based at least in part on theaudio-video descriptor having been extracted. In several embodiments,the audio-video signature includes at least an audio fingerprint and avideo hash bit associated with a time slice of video content. In variousembodiments, generation of an audio-video signature on the device can beperformed by an application, and the generated audio-video signature canbe used by the application for search or provided from the applicationby an API. In some embodiments, generation of an audio-video signatureon the device can include an API providing raw descriptor extractionsfrom which another application, which can be on or off the device, cangenerate the audio-video signature.

At block 912, the device, such as device 304, configured to provide anaudio-video signature via a signature module, such as signature module430, provides an audio-video signature associated with one or more ofthe time slices of video content generated based at least in part on theaudio-video descriptor having been extracted as a query. In variousembodiments providing the audio-video signature includes sending theaudio-video signature as a query toward a dataset. In variousembodiments, the dataset includes a layered audio-video indexed dataset.

At block 914, the device, such as device 304, configured to receivecandidate results responsive to the query via a results module, such asresults module 432, receives candidate results responsive to the query.In various embodiments receiving the candidate results responsive to thequery includes receiving the candidate results as a progressive listingof candidate results before reaching an end of the time slices of videocontent being received.

At block 916, the device, such as device 304, configured to presentcandidate results responsive to the query via a user interface module,such as user interface module 434, causes candidate results to bepresented. In various embodiments presenting the candidate resultsincludes presenting the candidate results in a user interface of thedevice before reaching an end of the time slices of video content beingreceived. In some embodiments presenting the candidate results includespresenting updated candidate results in the user interface of the devicebefore reaching an end of the time slices of video content beingreceived. Such updated candidate results can represent progressivecandidate results for a progressive candidate results listing.

FIG. 10 illustrates an example process 1000 for implementing videosearch on a server, such as server 306, using a layered audio-videoindex, such as LAVE index 516.

Although process 1000 is described as being performed on a server, insome embodiments a system including one or more servers, which mayinclude multiple devices in a network-based or cloud configuration asdescribed above and in some instances at least one client device, canperform process 1000.

At block 1002, a device such as server 306 configured to receive a queryaudio-video signature as input via a layered audio-video engine, such aslayered audio-video engine 328, receives a query audio-video signatureas input. In various embodiments the query audio-video signature isreceived as input for a layered audio-video search. In some embodimentsthe query audio-video signature is received as input for a layeredaudio-video search from a mobile device such as device 304.

At block 1004, a device such as server 306 configured to search alayered audio-video index to identify entries having a similarity to thequery audio-video signature, such as LAVE search module 528, performs asearch of a layered audio-video index associated with the layeredaudio-video engine to identify entries in the layered audio-video indexhaving a similarity to the query audio-video signature. In variousembodiments the search identifies entries having a similarity to thequery audio-video signature above a threshold. In various non-exclusiveinstances the threshold can include a predetermined similaritythreshold, a variable similarity threshold, a relative similaritythreshold, and/or a similarity threshold determined in real time.

At block 1006, a device such as server 306 configured to performgeometric verification of the entries having a similarity to the queryaudio-video signature, such as geometric verification module 530,performs geometric verification of entries from the layered audio-videoindex having similarity to the query audio-video signature. In variousembodiments performing geometric verification includes performinggeometric verification of respective key frames from the queryaudio-video signature and entries from the layered audio-video indexhaving the similarity.

At block 1008, a device such as server 306 configured to send candidateresults, such as decision module 534, send candidate results which aresimilar to the query audio-video signature. In various embodimentssending candidate results identified via the geometric verificationincludes sending candidate results identified via the geometricverification toward the mobile device such as mobile device 304 fromwhich the query audio-video signature was received.

FIG. 11 illustrates another example process 1100 for implementing videosearch on a server, such as server 306, using a layered audio-videoindex, such as LAVE index 516.

Although process 1100 is described as being performed on a server, insome embodiments a system including one or more servers, which mayinclude multiple devices in a network-based or cloud configuration asdescribed above and in some instances at least one client device, canperform process 1100.

At blocks 1102, 1104, and 1106, a device such as server 306 configuredas described regarding process 1000, such as with layered audio-videoengine 328, performs operations corresponding to blocks 1002, 1004, and1006, respectively.

At block 1108, a device such as server 306 configured to performprogressive processing, such as progressive query module 532, processescandidate results identified via the geometric verification. In variousembodiments processing candidate results identified via the geometricverification includes progressively processing entries having respectiveaudio-video signatures. In some embodiments, progressively processingentries having respective audio-video signatures includes employingtwo-part graph-based transformation and matching.

At block 1110, a device such as server 306 configured to send candidateresults, such as decision module 534, sends candidate results accordingto the progressive processing. In various embodiments sending candidateresults according to the progressive processing includes sendingcandidate results according to the progressive processing toward themobile device such as mobile device 304 from which the query audio-videosignature was received. In some embodiments, sending candidate resultsaccording to the progressive processing includes sending candidateresults in a configuration to indicate the candidate results have beenupdated and searching will continue such as 112. In some embodiments,sending candidate results according to the progressive processing alsoincludes sending stabilized candidate results in a configuration toindicate the candidate results have not been updated and searching willbe ceased such as 200.

At block 1112, a device such as server 306 configured to send candidateresults, such as decision module 534, determines whether the candidateresults from the progressive processing are stable. In variousembodiments determining whether the candidate results from theprogressive processing are stable includes determining whether to updatethe candidate results based at least in part on whether the candidateresults are maintained. In some embodiments, determining whether thecandidate results from the progressive processing are stable includes,determining whether the candidate results are stable for a period oftime. In some embodiments, the period of time is measured in seconds. Insome embodiments, the period of time is two seconds. In someembodiments, the period of time is three seconds. In some embodiments,the period of time is variable and/or relative to the number of timesthe progressive query process has been performed without ceasing thesearch.

In some embodiments, responsive to the candidate results beingdetermined to be stable at block 1112, at block 1114, a device such asserver 306 configured to end querying, such as decision module 534,ceases searching corresponding to the audio-video content. In variousembodiments when the candidate results are determined to be stable for aperiod of time at block 1112, includes ceasing the receiving, searching,performing, and processing corresponding to the audio-video content. Insome embodiments, ceasing searching at block 1114 can include sendingcandidate results according to the progressive processing in aconfiguration to indicate the candidate results have not been updatedand searching is being ceased such as in the user interface of 200.

In some embodiments, responsive to the candidate results beingdetermined not to be stable at block 1112, a device such as server 306configured to end querying, such as decision module 534, continuessearching. In various embodiments when the candidate results aredetermined not to be stable for a period of time at block 1112, includescontinuing searching by returning flow to block 1102, which can includerepeating the receiving, searching, performing, and processingcorresponding to the audio-video content. In some embodiments,continuing searching by returning flow to block 1102 can include sendingcandidate results according to the progressive processing in aconfiguration to indicate whether the candidate results have beenupdated such as in the user interface of 200.

Additional Examples of Embodiments

Embodiment A includes a method comprising: accessing a video dataset;performing audio-video descriptor extraction on respective videos fromthe video dataset; generating a series of audio-video signaturesassociated with time slices of the respective videos; and building alayered audio-video index in which the entries include the series ofaudio-video signatures.

Embodiment B includes a method comprising: extracting audio-videodescriptors corresponding to individual videos in a video dataset;acquiring an audio index, the audio index including audio fingerprintsfrom the audio-video descriptors; acquiring a visual index, the visualindex including visual hash bits from the audio-video descriptors;creating a first layer including a multi-index by associating the audioindex and at least a part of the visual index; creating a second layerincluding the visual index; and maintaining a time relationship betweenthe multi-index of the first layer and the visual index of the secondlayer.

Embodiment C includes a method as described regarding embodiments Aand/or B, wherein the at least a part of a visual index for creating afirst layer includes a random selection of hash bits from a secondlayer.

Embodiment D includes a method as described regarding embodiments A, B,and/or C, further comprising refining a number of visual points to besearched in a second layer via an audio index.

Embodiment E includes a method comprising: receiving a query audio-videosignature related to video content at a layered audio-video engine;searching a layered audio-video index associated with the layeredaudio-video engine to identify entries in the layered audio-video indexhaving a similarity to the query audio-video signature above athreshold; performing geometric verification of respective key framesfrom the query audio-video signature and entries from the layeredaudio-video index having the similarity; and sending candidate resultsidentified via the geometric verification.

Embodiment F includes a method comprising: receiving a query audio-videosignature related to video content at a layered audio-video engine;searching a layered audio-video index associated with the layeredaudio-video engine to identify entries in the layered audio-video indexhaving a similarity to the query audio-video signature above athreshold; performing geometric verification of respective key framesfrom the query audio-video signature and entries from the layeredaudio-video index having the similarity; progressively processingentries having respective audio-video signatures; determining whetherthe candidate results are stable; and determining whether to update thecandidate results based at least in part on whether the candidateresults are maintained; sending candidate results identified inaccordance with whether the candidate results are maintained; in anevent the candidate results are not maintained for a predeterminedperiod of time, repeating the receiving, searching, performing, andprocessing corresponding to the audio-video content; and in an event thecandidate results are maintained for a predetermined period of time,ceasing the receiving, searching, performing, and processingcorresponding to the audio-video content.

CONCLUSION

With the ever-increasing functionality and data access available throughmobile devices, such devices can serve as personal Internet-surfingconcierges that provide users with access to ever increasing amounts ofdata while on the go. By leveraging the computing resources madeavailable by a mobile device as described herein, a mobile video searchtool can effectively perform a video search without sending a clip ofthe video itself as the query.

Although a mobile video search system has been described in languagespecific to structural features and/or methodological operations, it isto be understood that the features and operations defined in theappended claims is not necessarily limited to the specific features oroperations described. Rather, the specific features and operations aredisclosed as example forms of implementing the claims.

1. (canceled)
 2. A method comprising: receiving, via an input componentof a computing device, a plurality of time slices of video content;extracting audio-video descriptors for the time slices of video content,to obtain aural and visual characteristics of the video contentcorresponding to the time slice; generating an audio-video signatureassociated with one or more of the time slices of video content based atleast in part on the audio-video descriptor having been extracted;providing the audio-video signature associated with the one or more timeslices of video content as a query toward a dataset; receiving candidateresults of the query before reaching an end of the time slices of videocontent before a time window allowed for the query lapses; andpresenting at least some of the candidate results before reaching theend of the time slices of video content.
 3. A method as recited in claim2, wherein the time slices of video content are received from a videooutput device not associated with the computing device.
 4. A method asrecited in claim 2, wherein the time slices of video content arereceived directly or indirectly by at least one of a camera input deviceor a microphone input device associated with the computing device.
 5. Amethod as recited in claim 4, wherein the time slices of video contentare received from a video output device not associated with thecomputing device.
 6. A method as recited in claim 2, wherein a length ofindividual ones of the plurality of time slices includes at least about0.1 second and at most about 10.0 seconds.
 7. A method as recited inclaim 2, wherein the dataset includes a layered audio-video indexeddataset.
 8. A method as recited in claim 2, wherein the audio-videosignature includes an audio fingerprint and a video hash bit associatedwith the time slice of video content.
 9. A system configured to performa method as recited in claim
 2. 10. A computer-readable medium havingcomputer-executable instructions encoded thereon, thecomputer-executable instructions configured to, upon execution, programa device to perform a method as recited in claim
 2. 11. A mobile deviceconfigured to perform a method as recited in claim
 2. 12. A method oflayered audio-video search comprising: receiving a query audio-videosignature related to video content at a layered audio-video engine;searching a layered audio-video index associated with the layeredaudio-video engine to identify entries in the layered audio-video indexhaving a similarity to the query audio-video signature above athreshold; performing geometric verification of respective key framesfrom the query audio-video signature and entries from the layeredaudio-video index having the similarity; and sending candidate resultsidentified via the geometric verification until a window of time allowedfor a query using the query audio-video signature has elapsed.
 13. Amethod as recited in claim 12, further comprising progressivelyprocessing entries having respective audio-video signatures.
 14. Amethod as recited in claim 13, wherein the progressively processing theentries having respective audio-video signatures includes employingtwo-part graph-based transformation and matching.
 15. A method asrecited in claim 12, further comprising: determining whether thecandidate results are stable; and determining whether to update thecandidate results based at least in part on whether the candidateresults are maintained.
 16. A computer-readable medium havingcomputer-executable instructions encoded thereon, thecomputer-executable instructions configured to, upon execution, programa device to perform operations as recited in claim
 12. 17. A systemconfigured to perform a method as recited in claim
 12. 18. A computingdevice configured to perform a method as recited in claim
 12. 19. Amethod of building a layered audio-video index comprising: extractingaudio-video descriptors corresponding to individual videos in a videodataset until a time window allowed for completing a query using amulti-index has elapsed; acquiring an audio index, the audio indexincluding audio fingerprints from the audio-video descriptors; acquiringa visual index, the visual index including visual hash bits from theaudio-video descriptors; creating a first layer including themulti-index by associating the audio index and at least a part of thevisual index; creating a second layer including the visual index; andmaintaining a time relationship between the multi-index of the firstlayer and the visual index of the second layer.
 20. A method as claim 9recites, wherein the at least a part of the visual index for creatingthe first layer includes a random selection of hash bits from the secondlayer.
 21. A method as claim 9 recites, further comprising refining thenumber of visual points to be searched in the second layer via the audioindex.